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Cross Reference To Related Applications 

This application claims priority under U.S.C. 119(e) from U.S. Provisional 
Application No. 60/255,562, entitled "High Speed DES Engine" as of filing on 
10 December 13, 2000, the disclosure of which is herein incorporated by reference for 
all purposes. 

Background of the Invention 

15 1. Field of the Invention. 

The present invention relates to implementing a cryptography engine. More 
specifically, the present invention relates to methods and apparatus for efficient 
performance of a cryptography algorithm such as a DES algorithm. 

20 

2. Description of Related Art 

Conventional software and hardware designs for implementing various 
cryptography algorithms including the DES and the triple DES algorithms have 

25 significant performance limitations. Many designs for performing cryptography 
processing are well known in the art and are discussed, for example, in Applied 
Cryptography, Bruce Schneier, John Wiley & Sons, Inc. (ISBN 0471128457), 
incorporated by reference in its entirety for all purposes. In order to improve the 
speed of cryptography processing, specialized cryptography accelerators have been 

30 developed that typically out-perform similar software implementations. Examples of 
such cryptography accelerators include the BCM™ 5805 manufactured by Broadcom, 
Inc. of San Jose, CA. 
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Although specialized hardware cryptography accelerators can often 
outperform software written to perform the same tasks, conventional hardware 
cryptography accelerators have significant performance bottlenecks for implementing 
cryptography algorithms including the DES and the triple DES algorithms. Some 
performance bottlenecks include limitations related to key scheduling, clock rate, 
clock synchronization, and core logic operations. 

It is therefore desirable to provide methods and apparatus for improving the 
implementation of cryptography algorithms with respect to some or all of the 
performance limitations noted above. 
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Summary of the Invention 

Methods and apparatus are provided for implementing a cryptography engine 
for cryptography processing. A variety of techniques are described. A cryptography 

5 engine such as a DES engine can be decoupled from surrounding logic by using 
asynchronous buffers. Bit-sliced design can be implemented by moving expansion 
and permutation logic out of the timing critical data path. An XOR function can be 
decomposed into functions that can be implemented more efficiently. A two-level 
multiplexer can be used to preserve a clock cycle during cryptography processing. 

10 Key scheduling can be pipelined to allow efficient round key generation. 

In one embodiment, a cryptography engine for performing cryptographic 
operations on a data block is provided. The cryptography engine includes a key 
scheduler configured to provide keys for cryptographic operations, a two-level 

15 multiplexer, expansion logic, and permutation logic. The expansion logic is coupled 
to the input stage of the multiplexer circuitry. The expansion logic is configured to 
expand a first bit sequence having a first size to an expanded first bit sequence having 
a second size greater than the first size. The first bit sequence corresponds to a 
portion of the data block. The permutation logic is coupled to the expansion logic. 

20 The permutation logic is configured to alter a second bit sequence corresponding to 
the portion of the data block, whereby altering the second bit sequence performs 
cryptographic operations on the data block. 

The cryptography engine can also include an Sbox configured to alter a third 
25 bit sequence corresponding to the portion of the data block by compacting the size of 
the third bit sequence and altering the third bit sequence using Sbox logic. In one 
embodiment, the cryptography engine is a DES engine. The cryptography engine can 
have two 2-to-l multiplexers on the first level coupled to two 2-to-l multiplexers on 
the second level. The two-level multiplexer can be configured to select either initial 
30 data, swapped data, or non-swapped data to provide to the output stage of the 
multiplexer. 
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According to another embodiment, an integrated circuit layout associated with 
a cryptography engine for performing cryptographic operations on a data block is 
provided. The integrated circuit layout can provide information for configuring the 
cryptography engine. The integrated circuit layout includes a key scheduler 
5 configured to provide keys for cryptographic operations, a two-level multiplexer, 
expansion logic and permutation logic. The expansion logic is coupled to the input 
stage of the multiplexer circuitry and is configured to expand a first bit sequence 
having a first size to an expanded first bit sequence having a second size greater than 
the first size. The first bit sequence corresponds to a portion of the data block. The 
10 permutation logic is coupled to the expansion logic. The permutation logic is 
configured to alter a second bit sequence corresponding to the portion of the data 
block, whereby altering the second bit sequence performs cryptographic operations 
on the data block. 

15 According to other embodiments, a cryptography engine for performing 

cryptography operations on a plurality of packets is provided. The packets can have 
payloads and payload gaps. The cryptography engine includes a DES engine, an 
asynchronous input buffer coupled to the cryptography engine input; and surrounding 
logic coupled to the DES engine through the asynchronous input buffer, wherein the 

20 DES engine operates at a first clock rate and the surrounding logic operates at a 
second clock rate different from the first clock rate. 

These and other features and advantages of the present invention will be 
presented in more detail in the following specification of the invention and the 
25 accompanying figures, which illustrate by way of example the principles of the 
invention. 
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Brief Description of the Drawings 

The invention may best be understood by reference to the following 
description taken in conjunction with the accompanying drawings. 

5 

Figure 1 is a diagrammatic representation of a system that can use the 
techniques of the present invention in accordance with one embodiment of the present 
invention. 

Figure 2 is a diagrammatic representation of a cryptography engine having 
10 surrounding logic and a DES engine in accordance with one embodiment of the 
present invention. 

Figure 3 is a diagrammatic representation of a packet that the DES engine 
may receive in accordance with one embodiment of the present invention. 

Figure 4 is diagrammatic representation of a DES engine in accordance with 
15 one embodiment of the present invention. 

Figure 5 is a diagrammatic representation of a DES engine that can be used 
for bit-sliced implementation in accordance with one embodiment of the present 
invention. 

Figure 6 is a diagrammatic representation of a bit-sliced implementation that 
20 can be used in Figure 5 in accordance with one embodiment of the present invention. 

Figure 7 is a diagrammatic representation of an optimized bit-sliced 
implementation in accordance with one embodiment of the present invention. 
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Detailed Description of Specific Embodiments 

Reference will now be made in detail to some specific embodiments of the 
invention including the best modes contemplated by the inventors for carrying out the 

5 invention. Examples of these specific embodiments are illustrated in the 
accompanying drawings. While the invention is described in conjunction with these 
specific embodiments, it will be understood that it is not intended to limit the 
invention to the described embodiments. On the contrary, it is intended to cover 
alternatives, modifications, and equivalents as may be included within the spirit and 

10 scope of the invention as defined by the appended claims. In the following 
description, numerous specific details are set forth in order to provide a thorough 
understanding of the present invention. The present invention may be practiced 
without some or all of these specific details. In other instances, well known process 
operations have not been described in detail in order not to unnecessarily obscure the 

15 present invention. 

Techniques are provided for enabling the implementation of efficient 
cryptographic processing, such as DES and triple DES processing. DES specifies 
encrypting individual 64-bit data blocks. A 64-bit data block of unencrypted data is 

20 provided to the DES engine, combined with a key, and output as a 64-bit data block 
of encrypted data. The key used for DES processing is typically a 56-bit number, 
although the key can be expressed as a 64-bit number. DES describes breaking up a 
64-bit block of data into a right half and a left half, each 32-bits long. As will be 
appreciated by one of skill in the art, sixteen rounds of cryptographic operations are 

25 then performed. In each round, operations on the right half of the data include 
expansion, permutation, Sbox operations, and combination with a round key. A 
round key can be determined based on the round number of DES processing is 
referred to herein as a round key. The round key can be derived by applying 
permutation and shift functions to all 56 bits of the original key. The round number of 

30 DES determines the shift amount. 

An XOR function is used to combine the right half of the data with a version 
of the key based on the round. The result can then be combined with the left half also 
by way of an XOR function. The unprocessed right half of the data becomes the left 
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half of the data for the next round. Triple DES specifies performing three 16 round 
DES operations consecutively using three different keys. Typical hardware 
implementations for DES or triple DES perform one round calculation per clock 
cycle. The performance of DES or triple DES engine is therefore determined by the 
5 clock rate. However, a variety of hardware design constraints have traditionally 
limited the clock rate of cryptographic processing engines. 

One design constraint that has traditionally limited the clock rate of 
cryptographic processing engines is the clock rate of circuitry surrounding the 

10 cryptographic processing unit. Surrounding logic having a slower clock rate than the 
DES engine is typically unable to provide data at a sufficient rate for DES processing. 
Methods and apparatus are described for decoupling the DES engine and the 
surrounding logic. The decoupled DES engine and surrounding logic allow flexibility 
in setting different clock ratios. The flexibility allows the DES engine to be designed 

15 to operate at a much higher clock rate than the surrounding logic. 

Other design constraints relate to resistance capacitance (RC) delay. As will 
be appreciated by one of skill in the art, RC delay is a significant limitation in the 
performance of high-speed logic chips. RC delay in high speed circuit design can 

20 result from the resistance of interconnections and the capacitance associated with the 
dielectric media. RC delay is especially pronounced in sub 0.18 micron technologies. 
One way of limiting the effects of RC delay is to use bit-sliced design. That is, data 
paths are made narrower to allow more efficient custom-designs and layouts. In one 
example, the 32-bit data paths used in DES cryptography processing would be split 

25 into eight 4-bit data paths using efficiently designed layouts. Bit-sliced design, 
however, can not be used in conventional DES cryptography processing because the 
32-bit data paths include expansion and permutation operations. The 32-bit blocks in 
the data paths are typically expanded, combined with a key, provided to an Sbox, and 
permuted. Bit-sliced design cannot be used because the bits in the 32-bit data blocks 

30 are intermingled. In other words, the bits in one four bit-slice would typically be 
swapped with bits from other four bit-slices because the bits are periodically 
permuted. Because there is no easy mechanism for keeping track of the bit swapping, 
bit-sliced design can not easily be implemented. Techniques are described for 
moving the permutation and expansion operations out of the 32-bit data path used in 
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DES cryptography processing. By moving the permutation and expansion operations 
out of the data path, bit-sliced implementation can be used to limit the effects of RC 
delay in the time critical DES processing data path. 

5 DES also describes the use of XOR operations in cryptography processing. 

As will be appreciated by one of skill in the art, XOR operations are computationally 
expensive. Techniques are provided for using more fundamental operations to replace 
XOR operations in the data path. 

10 As noted above, the DES algorithm specifies sixteen rounds of processing. 

Triple DES performs the DES algorithm three consecutive times. During each round 
of DES processing, the left and right halves of the 64-bit data are swapped. However, 
in triple DES, the left and right halves of the 64-bit data are not swapped during the 
transitions between the consecutive performance of 16 round DES operations. To 

15 allow non-swapping of the left and right halves, logic that takes an extra clock cycle 
and a 3-to-l multiplexer is typically used. Techniques using a two level 2-to-l 
multiplexer are provided that do not take a extra clock cycle during the transitions 
between the sixteen rounds of DES processing. As will be appreciated by one of skill 
in the art, 2-to-l multiplexers can be integrated together with registers. 

20 

The present invention also contemplates pipelined key scheduling to allow a 
round key to be provided to each DES round at a high clock rate. The key generation, 
selection, provision, and consumption operations can be separated into different 
stages for pipelined processing. 

25 

The techniques of the present invention can be implemented in a variety of 
contexts. Figure 1 describes the present invention in terms of a cryptographic 
accelerator system that can be implemented in a number of ways, such as for 
example, as a stand alone integrated circuit, as embedded software, or as a subsystem 
30 included in, for example, a server computer used in a variety of Internet and Internet 
related activities. It should be noted, however, that the invention is not limited to the 
described embodiments and can be used in any system where data encryption is 
desired. 
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Figure 1 is a diagrammatic representation of one example of a cryptographic 
processing system 100 in accordance with an embodiment of the invention. As 
shown in Figure 1, the present invention may be implemented in a stand-alone 
cryptography accelerator 102 or as part of the system 100. In the described 
embodiment, the cryptography accelerator 102 is connected to a bus 104 such as a 
PCI bus via a standard on-chip PCI interface. The processing system 100 includes a 
processing unit 106 and a system memory unit 108. The processing unit 106 and the 
system memory unit 108 are coupled to the system bus 104 via a bridge and memory 
controller 110. Although the processing unit 106 may be the central processing unit 
or CPU of a system 100, it does not necessarily have to be the CPU. It can be one of 
a variety of processors. A LAN interface 114 couples the processing system 100 to a 
local area network (LAN) and receives packets for processing and writes out 
processed packets to the LAN (not shown). Likewise, a Wide Area Network (WAN) 
interface 112 connects the processing system to a WAN (not shown) such as the 
Internet, and manages in-bound and out-bound packets, providing automatic security 
processing for IP packets. 

A cryptography accelerator 102 can perform many cryptography algorithms 
including the DES and the triple DES algorithms. It should be noted that although the 
techniques of the present invention will be described in the context of DES and triple 
DES for clarity, one of skill in the art will appreciate that the techniques can also 
apply to other cryptography algorithms including variations to the DES and triple 
DES algorithms. 

Figure 2 is a diagrammatic representation of a cryptographic processing unit 
in accordance with one embodiment of the present invention. The cryptographic 
processing unit 201 has input and output buffers 203 and 205. The input and output 
buffers 203 and 205 may be coupled to an external processor as shown in Figure 1. 
The cryptographic processing unit 201 contains a cryptography engine 207. The 
cryptographic processing unit 201 can also contain an authentication engine 217. As 
will be appreciated by one of skill in the art, other components including decoding 
logic may also be components in the cryptographic processing unit 201. The 
cryptography engine 207 contains surrounding logic 209 and DES engine 213. DES 
engine 213 can include asynchronous FIFO interfaces 211 and 215. The 
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asynchronous FIFO interfaces 211 and 215 allow DES engine 203 to operate at a 
much higher clock rate than the surrounding logic 209. The asynchronous FIFO 
interface 211 also can be used to convert 32-bit blocks received from the surrounding 
logic 209 to 64-bit blocks for DES processing. The asynchronous FIFO interfaces 
5 215 similarly he can be used to convert the 64-bit blocks for DES processing back to 
32-bit blocks for the surrounding logic 209. 

The asynchronous FIFO interfaces 211 and 215 are provided to decouple the 
operations of the surrounding logic 209 in the DES engine 213. The surrounding 

10 logic 209 can parse the data to allow processing of portions of the data by the DES 
engine 213. The DES engine 213 can process a data block whenever it is available at 
the input FIFO. The FIFO interfaces 211 and 215 compensate for the rate mismatch 
between the surrounding logic 209 and the DES engine 213. The asynchronous FIFO 
interfaces provide flexibility for setting different clock ratios between DES engine 

15 213 and surrounding logic 209 and allow reuse of the same DES engine 213 design in 
other processor configurations. In one example, the clock ratio between the DES 
engine 213 and the surrounding logic 209 is 3 to 1. In this example, the DES engine 
can operate at about 500 MHz while the surrounding logic can operate at about 166 
MHz. The FIFO interface can be sized to provide sufficient data for the DES engine 

20 to process. The clock rate of the surrounding logic 209 as well as the latency to load 
and unload data for the DES engine 213 can be considered. In one example, the input 
FIFO 211 must be large enough to continually provide DES engine 213 with data. 
Each data block the surrounding logic processes can contain data that the DES engine 
213 processes. 

25 

While this embodiment of the invention is described with respect to a DES 
engine, the invention may be applied more generally to other cryptography engines 
such as AES. 

30 Figure 3 is a diagrammatic representation of data packets that the surrounding 

logic 209 may send to cryptography engine 207 by way of the FIFO interface 203 in 
accordance with one embodiment of the present invention. Packet 301 contains a 
header portion 303. Portion 305 contains keys and an initialization vector. 
Initialization vectors (IV) are described in RFC 2405 titled The ESP DES-CBC 
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Cipher Algorithm With Explicit IV, the entirety of which is incorporated by reference 
for all purposes. 'CBC' stands for cipher block chaining mode. This mode specifies 
that the cryptographic processing of 64-bit data blocks should be chained together in 
a way such that the result of last 64-bit data becomes the IV for processing the block 
follows. CBC mode can prevent a data block from being processed in parallel in 
many conventional systems. 

Portion 307 contains the payload. The payload is the portion of packet 301 
that is provided to DES engine 213 for DES processing. The payload 307 can be 
followed by a trailer 309. The payload header 303, keys and initialization vector 
portion 305, and trailer 309 constitute the payload gap 311 associated with the data 
from surrounding logic 209. Data in the payload portion of the packet can be 
continually provided to the DES engine 213 by determining the payload gap 311 and 
the clock rate of the surrounding logic 209. 

As noted above, the DES engine 213 may be processing at 500 MHz while the 
surrounding logic 209 may only be able to provide data to DES engine 213 at a rate 
of about 166 MHz. The FIFO interface 211 can comprise registers also clocked at 
166 MHz. Assuming that a single 32-bit word is bypassed every 166 MHz clock 
cycle, it would take 50 clock cycles at 166 MHz to bypass a payload gap 311 of 50 
words. According to various embodiments, it takes 50 clock cycles to process a 64- 
bit block at 500 MHz. The 50 clock cycles at 500 MHz correspond to approximately 
17 clock cycles at 166 MHz. The FIFO interface 21 1 would be configured to be large 
enough to hold at least four 32-bit blocks. That is, 64-bits of payload can be available 
in the FIFO interface 211 before DES engine 213 completes processing of a single 
64-bit block. In one embodiment, the FIFO is sized to hold eight 32-bit blocks. Two 
32-bit words are used to hold data between packets so that the DES engine knows the 
end of one payload. Therefore, six blocks are available for payload. In some 
implementations, the payload gap can exceed 50 words. Gaps can include the 
following entries: MCW (1 32-bit word), Host Context (4 words), SA data (keys, 18 
words), Buffer Descriptor (3 words), Ipv4 header and options (15 words), ESP header 
(IV, 4 words), ESP trailer (3 words), Status trailer (4 words), and a variable length 
Ethernet header and trailer. 
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Figure 4a is a diagrammatic representation of a DES engine 213 as shown in 
Figure 2. The DES engine can include an input FIFO 401. The input FIFO 401 can 
be asynchronous to decouple the DES engine from surrounding logic. A 64-bit data 
block is combined with an initialization vector from initialization vector block 405. 
Initialization vectors are described in RFC 2405 as noted above. The 64-bit block 
then undergoes an initial permutation at 407. The initial permutation occurs before 
round 1. It should be noted that in some variations to DES, to which the present 
invention is applicable, initial permutation and final permutation operations are not 
performed. The 64-bit block is then passed to two-level multiplexer stage 409. Two- 
level multiplexer stage 409 contains four multiplexers for determining whether to 
load initial data, swap data from the previous round, or not swap data from the 
previous round. Initial data is loaded in the first round of DES processing. Data is 
swapped between rounds of DES processing. Data is not swapped in triple DES 
between the completed 16 rounds of DES processing. Control logic (not shown) can 
track the round number in order to determine what signals to send to the multiplexers. 
The first level multiplexers are sent a signal to load or not to load data from a 
previous round or to allow initial data to pass through after initial permutation 407. 

The second level multiplexers are provided a control signal instructing the 
multiplexers to swap or not swap the feedback data or load the initial data. The initial 
data is never swapped. The control signal can be the load value provided in the first 
level combined by a bitwise OR with the swap value in the right half and combined 
with the one's complement of the swap value in the left half. In a four-bit data path 
example, the load value is 1 and the swap value is 0. It should be noted that the 
control signals can be a variety of numbers including a sequence of ones or a 
sequences of zeros. In the case of loading initial data, initial data is loaded into the 
first level of multiplexers and provided to the second level of multiplexers. It can pass 
the second level of multiplexers if the load value is 1. According to various 
embodiments, the load value of 1 combined with a bitwise OR with the swap value of 
0 is 1. The load value of 1 combined with a bitwise OR with the swap value of 1 is 
also 1. In the case of load feedback data, the load value is 0. The feedback data is 
swapped according to the value of signal swap. Registers 411 and 413 receive the 
initial data or the feedback data. 
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After the first round of processing, the load value is switched to 0. The swap 
value is now 1 to indicate that the left and right halves should be switched. Data is 
passed back along line 453 to the first level of multiplexers 409. Since the load value 
is 0, the data passed back along line 453 is selected and provided to the second level 
of multiplexers. The (ld|~swap) value for the multiplexer corresponding to register 
411 is (0|0) so the right half passed back along line 451 is selected and provided to 
register 411. The (ld|swap) value for the multiplexer corresponding to register 413 is 
(0|1) so the data provided from the first level multiplexer corresponding to data from 
line 453 is sent to register 413. 

One benefit of the two level multiplexers is that the timing critical 32-bit data 
is arranged to go through only the second level multiplexer, which can be combined 
with the registers 41 1 and 413 to minimize the delay. 

As will be appreciated by one of skill the art, various control signals. An 
alternative approach would be to use a 3-to-l single level multiplexer to select either 
the initial data, the swapped feedback data, or the non-swapped feedback data. The 
use of a 3-to-l single level multiplexer however adds significant delay to the timing 
critical datapath because the 3-to-l multiplexer can not be easily combined with the 
registers. The 3-to-l multiplexer uses an extra clock cycle. According to a preferred 
embodiment, four 2-to-l multiplexers are used. The 2-to-l multiplexers in the second 
level of the multiplexer stage 409 can be integrated with the registers 411 and 413. 
Register 411 contains the last half of this initial 64-bit block in round 1. Register 413 
contains the right half of the 64-bit block in round 1. Registers 411 and 413 both 
typically hold 32-bits of data. The 32-bit data block contained in register 413 is 
provided to both expansion stage 415 and to register 411 through multiplexer stage 
409 for the next round. Control signals in multiplexer stage 409 are configured to 
provide the 32-bit data block contained in register 413 to register 411 in the next 
round of DES processing. The 32-bit data block is provided to expansion logic 415. 

As will be appreciated by one of skill in the art, the expansion logic 415 
changes the order of the bits in the 32-bit at block and also repeats certain bits. The 



BRCMP013A/BP1688.1/JEA/GKK 



expansion logic 415 uses the 32-bit block to generate a 48-bit block. The expansion 
logic improves the effectiveness of the encryption process and also makes the 32-bit 
block into a 48-bit block that corresponds to the size of the key. The 48-bit block can 
then be combined with an XOR with the 48-bit round key at 417. 

Keys are provided by key generation logic or key scheduler circuitry 419. A 
version of the key for cryptography processing of the original 64-bit block is provided 
by key scheduler 419. Key scheduler 419 can provide a different version of the 
original key for every round by applying permutation and shift functions to all 56 bits 
of the original key. The 48-bit block resulting from the XOR at 417 is provided to 
Sbox stage 427. As will be appreciated by one of skill the art, each Sbox in Sbox 
stage 427 converts a six-bit input into a four-bit output. According to various 
embodiments eight Sboxes are provided in Sbox stage 427. Sboxes as well as other 
cryptography operations are described in Applied Cryptography, Bruce Schneier 
(ISBN 0471128457), the entirety of which is incorporated by reference for all 
purposes. 

The 32-bit output of Sbox stage 427 is provided to permutation stage 429. A 
permutation stage 429 maps input bits in certain positions to different output 
positions. In one example, bit 1 is moved to position 16 of the output, bit 2 is moved 
to position 7 of the output, and bit 3 is moved to position 20 of the output, etc. The 
32-bit output of permutation stage 429 is combined with an XOR with the value in 
register 411 at 431. The result of the XOR is provided to the register 411 through 
multiplexer stage 409 for the next round of DES processing. The same process 
occurs for the subsequent rounds of DES processing. That is, the right half is 
expanded, combined with an XOR function with a version of the key, provided to an 
Sbox stage, permuted, and combined with an XOR with the left half. After the last 
round, the outputs are written to register 433 and register 435. The output can then 
undergo a final permutation at 437. The result of a final permutation at 437 is 
combined by way of an XOR with an initialization vector as noted above when the 
DES engine is used to decrypt data. Otherwise, the result of the final permutation at 
437 can remain unchanged by combining by way of an XOR with a sequence of 
zeros. 
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For triple DES, the outputs at 433 and 435 are passed back to multiplexer 
stage 409. Control circuitry determines how to pass the data back to register 411 and 
413 for a next 16 rounds of DES processing. In typical implementations, an extra 
5 clock cycle is required to determine whether to swap the left and right halves or not to 
swap the left and right halves. According to the techniques of the present invention, 
however, two-level multiplexer stage 409 avoids the extra clock cycle. Data can be 
fed back to multiplexer stage 409 as if it were another round of DES processing. 
Control logic, allows the proper determination of how to load the data based on 
10 whether it is another DES round or a transition before another 16 round DES 
algorithm. More specifically, the swap value can be 0 for transitions. The processed 
data resulting from the DES or triple DES algorithm is provided to output FIFO 441. 
Output FIFO 441 can be asynchronous FIFO like input FIFO 401. 

15 Figure 4b is a diagrammatic representation of a DES engine in accordance 

with one embodiment of the present invention showing critical and non-critical paths 
in the round processing logic. A block of 64-bit data is input through multiplexer 
stage 409 into left register 411 and right register 413. The 32-bit data in right register 
413 is provided to left register 411 for the next round of DES processing and to 

20 expansion block 415 as noted above in Figure 4a. The 48-bit output is combined 
using an XOR function with the round key stored in the key register 419 at 417. The 
48-bit data block is then provided to Sbox stage 427 and permuted at 429. The logic 
operations performed such as XOR 417 and Sbox 427 on the right half data path are 
resource intensive and make path 451 a timing critical path. The timing critical path 

25 451 can be implemented using bit-sliced design. Path 453 simply feeds the right half 
data back to the first level multiplexer. Path 453 contains less logic than path 451 and 
is therefore referred to as a non-critical path. 

As noted above, bit-sliced implementation cannot be performed because the 
30 permutation 429 and expansion logic 415 change the order of bits in the 32-bit data 
block. It would be difficult to split a 32-bit data path into eight separate four bit slices 
or a 48-bit data path into eight separate six bit slices because the bits in the timing 
critical data path are intermingled by the expansion and permutation logic. The 
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techniques of the present invention, however, contemplate moving the permutation 
and expansion logic operations out of the timing critical data path. By moving the 
expansion and permutation logic outside the timing critical data path, six bit slices 
can be custom-designed for combination with round keys and input into Sboxes. As 
5 noted above, Sboxes take six bit inputs and provide four bit outputs. 



The permutation and expansion logic are removed from the timing critical 
data path by recognizing the characteristics of the permutation logic. DES specifies 
that the permutation logic move bits from certain input positions to certain output 
10 positions. For example, the four bit sequence on 1, 0, 1, 1 can be permuted by 
moving bit 1 to bit 3, bit 2 to bit 4, bit 3 to bit 2, and bit 4 to bit 1. The resulting bit 
sequence would be 1, 1, 1,0. The inverse permutation would move bit 1 to bit 4, bit 
2 to bit 3, bit 3 to bit 1, and bit 4 to bit 2. The inverse permutation applied to the 
resulting bit sequence 1,1,1,0 would yield the original bit sequence 1,0, 1, 1. 

15 

It can be recognized that taking the inverse permutation of the permutation of 
the bit sequence yields the original bit sequence. The relationship can be expressed 
as the following equations: 

20 bitsequence = P~ l (P(bitsequence)) Equation 1 

bitsequence = P(P _1 (bitsequence)) Equation 2 

Equations 1 and 2 can be applied to the DES engine shown in Figure 4. The L 
value at 463 is combined with an XOR the permutation of R value at 461 the yield T 
25 value at 465. The relationship can be shown with the following set of equations: 



30 



T = P(R) XOR L 

T = P(R)XOR P(P 4 (L)) 

T = P(R XOR P _1 (L)) Equation 3 



From the equation, it can be recognized that the R value at 461 does not need 
to be permuted until after it is combined with an XOR with the modified L value. 
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Since the R value does not need to be permuted, the permutation logic can be moved 
out of the timing critical data path. 

Figure 5 is a diagrammatic representation of a DES engine in accordance with 
5 one embodiment of the present invention where the expansion and permutation logic 
are moved out of the timing critical data path. The T value at 515 corresponds to the 
T value at 465 in Figure 4. That is, T = P(R XOR P _1 (L)). More specifically, the L 
value at 513 is inverse permuted by inverse permutation logic 501. The R value at 
511 is combined with an XOR with the inverse permutation of L value at 513. The 
10 result is then permuted by permutation logic 517 to yield T value at 515, 

The data provided to the DES engine shown in Figure 5 is split into a right 
and left half. The left half is provided to inverse permutation logic 501. The right 
half is provided to both inverse permutation logic 503 and expansion logic 505. The 
15 right half provided to inverse permutation logic 503 is passed to register 519 through 
two-level multiplexer 507. It should be noted that pulling the permutation and 
expansion logic out of the timing critical data path can be performed with or without 
using a two-level multiplexer. 

20 The right half at register 519 is provided to register 521 for the next round of 

DES processing. The left half is provided to register 521 through two-level 
multiplexer 507. The left half or L value at 513 is combined with an XOR function 
with the R value at 511. The right half provided to expansion logic 505 is input into 
register 523 through two-level multiplexer 509. The value at 527 is then combined 

25 with an XOR with the round key and subsequently provided to Sbox stage 527. The 
result is R value at 511. The result is then fed back for the next round of DES 
processing. As noted above the two-level multiplexer determines whether to load the 
initial data, swap data, or not swap data. Since the permutation logic 517 and 
expansion logic 505 are both outside of the timing critical path, bit-sliced 

30 implementation can be used. 

Figure 6 is a diagrammatic representation showing a bit-sliced implementation 
that can be used in accordance with one embodiment of the present invention. The 



BRCMP013A/BP1688. 1/JEA/GKK 



17 



value di_l is one of the inputs to first level multiplexer 601. The value di_rs is one of 
the inputs to first level multiplexer 603. The value di_r is one of the inputs to first 
level multiplexer 605. The value rd_slice_di is one of the inputs to second level 
multiplexer 611. The value di_r_mux is the output value of multiplexer 605 and the 
5 other input to multiplexer 611. The value Id is the control signal provided to the 
multiplexers to determine what input should be provided to output. The Id value can 
be a sequence of ones or a sequence of zeros. The value sellout is the output of the 
multiplexer 611 provided to register 619. The value sel_out is combined with an 
XOR function with a round key at 621 to yield value sbox_di. According to a 
10 preferred embodiment, di_r_mux, rd_slice_di, sel_out, and sbox_di are all six-bit 
sequences. After sbox_di is provided to Sbox 625, a four-bit sequence sbox_do is 

y output. Sbox_do is combined with an XOR with the four bit value from register 615 

CD to output rd_slice_do. 



15 Bit-sliced implementation is possible because the permutation and expansion 

logic has been moved out of the data path corresponding to rd_slice_di and di__r. 
Using a six bit slice implementation, circuitry can be custom-designed to limit the 
effects of RC delay in the data path corresponding to rd__slice_di and di_r. 

20 As will be appreciated by one of skill in the art, RC delay is not the only 

factor limiting throughput. The XOR function is also typically slow. The data path 
can be further optimized by performing the XOR function together with the functions 
of the two multiplexers 605 and 611. Although XOR functions are typically slow, 
OR functions can be implemented more efficiently. The XOR function can be 

25 decomposed into a single OR function used with other logic operations. The 
decomposition can be accomplished by understanding the characteristics of the data 
at various logic operations. The sel_out value in register 619 is either di__r_mux or 
rd_slice_di. The determination is made by control signal sel equivalent to (ld|swap). 
That is, the di_r_mux value is selected if it is round one. The rd_slice_di value is 

30 selected if it is not round one. The swap value is based on whether the current data is 
in transition before a 16 round DES operation in triple DES. The selection can be 
represented by the following equations, where ~ is the one's complement, & is a 
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bitwise AND, A is an XOR, and | is a bitwise OR. The value sbox_di can be 
represented by the following equations: 

sel_out = (sel & di_r_mux | -sel & rd_slice_di) Equation 4 

5 

Where sel is 111111, di_r_mux is selected as sel_out. Where sel is 000000, 
rd_slice_di is selected as sellout. The value sellout XORed with the key yields 
sbox_di, shown by the following equation: 

10 sbox_di = (sel & di_r_mux | -sel & rd_slice_di) A key Equation 5 

An XOR function (a A b) can be decomposed as (a & ~b | -a & b). Equation 5 
can be decomposed in the following manner: 

15 sbox_di = (sel & di_r_mux | -sel & rd_slice_di) & -key | 

-(sel & di_r_mux | -sel & rd_slice__di) & key Equation 5b 

sbox_di = (-sel & rd_slice_di & -key) | (sel & di_r_mux & -key) | 

-(sel & di_r_mux) & -(-sel & rd__slice_di) & key Equation 5c 

20 

sbox_di = rd_slice_di & (-sel & -key) | 

(-sel & rd„slice_di | -(-sel & rd_slice_di))& (sel & di_r_mux & -key) | 
(sel | - rd_slice_di) & (-sel | - di_r_mux) & key Equation 5d 

25 sbox_di = rd_slice_di & (-sel & -key) | 

(-sel & rd_slice_di) & (sel & di jr_mux & -key) | 
(sel | -rd_slice_di)& (sel & di_r_mux & -key) | 

(sel ] - rd_slice_di) & (-sel | - di_r„mux) & key Equation 5e 

30 sbox_di = rd_slice_di & (-sel & -key) | 

(-sel & rd__slice_di) & (sel & dLr_mux & -key) | 
~rd_slice_di & (sel & di_r__mux) A key | 

sel & (sel & di_r_mux) A key Equation 5f 
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The following term of Equation 5f is logically equivalent to 0: ~rd_slice_jii & 
(sel & di_r_mux) A key | sel & (sel & di_r_mux) A key. Substituting in 0 yields 
Equation 6. 

5 

sbox_di = rd_slice_di & (~sel & -key) | ~rd_slice_di & ((sel & di_r_mux) A 
key) | sel & ((sel & di_r_mux) A key) Equation 6 

By substituting the following values, equation 7 can be derived. 

10 

inpl = -sel & -key 

inp2 = ((sel & di_r__mux) A key) 

inp3 = sel & ((sel & di jrjmux) A key) 

sbox_di = rd_slice_di & inpl | ~rd„slice_di & inp2 | inp3 Equation 7 

15 

Equation 7 is represented in Figure 7. Figure 7 is a diagrammatic 
representation of one bit-sliced implementation in accordance with one embodiment 
of the present invention, where the XOR at 621 is decomposed into an OR and 
supplementary logic. It should be noted that the key here is provided to inpl 705, 

20 inp2 707, and inp3 709. Both inpl 705 and inp2 707 are inputs to second level 
multiplexer 711. The value rd„slice is the control signal selecting either inpl 705 or 
inp2 707 at second level multiplexer 711. The output of multiplexer 711 in register 
713 is combined with an OR with the value in register 715. The output is sbox_di 
corresponding to the sbox_di value of Figure 6. Providing sbox_di to Sbox 717 

25 yields the same resulting sbox_do value that the implementation in Figure 6 would 
yield. The resulting DES operations can be designed to yield the same cryptographic 
results. 

As will be appreciated by one of skill in the art, other logic implementations 
30 can be used to improve cryptography processing performance in accordance with the 
present invention. In one implementation, a custom cell combines the OR gate with 
the registers 713 and 715. In another example, the XOR function 721 is decomposed 
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into SBOX logic 717. According to various embodiments, a 500 MHz design is 
achieved for DES processing. 

Another conventional performance limitation relates to key scheduling. In 
5 each round of DES processing, the key scheduler generates the round keys based on 
the original key based and the round counter. As the clock rate approaches 500 MHz, 
however, the key scheduling at each round becomes timing critical. For triple DES, 
one of three keys is pre-selected and a round key of one of the three keys is provided 
at each DES round. Pipelining is contemplated to allow the key to be provided in a 
10 high-speed DES environment. Key scheduling operations can be separated into 
different stages based on logical arrangements. In one embodiment, key scheduling is 
separated into operational stages that use approximately the same amount of time to 
perform their operations. In another embodiment, key scheduling separated into 
different stages for convenience. 

15 

In one specific embodiment, key scheduling is separated into four different 
stages. In the first stage, also called the determination stage, a shifting amount for 
performing the shift function is generated based on the round counter value. As noted 
above, the round keys are generated based on the original key and the round counter. 
20 In the second stage, also called the shift stage, the actual shift function is performed. 

In the third stage, also called the prorogation stage, the round is passed from 
the key scheduler to the round logic. The round key is registered at register 623 in 
Figure 6 or is input at inpl 705, inp2 707, and inp3 at 709. In the fourth stage also 
25 called the consumption stage, the round key is used in the corresponding DES round. 
As shown in table 1 below, the key scheduler takes advantage of pipeline processing. 



Table 1. 



Dl 


SI 


PI 


CI 






D2 


S2 


P2 


C2 






D3 


S3 


P3 
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During the determination of the second key, a first key is shifted. During the 
shift of the second key, a first key is propagated and a third key is determined. 
During the consumption of the first key, a second key is propagated and a third key is 
shifted. According to various embodiments, pipelined design separates key 
scheduling into four stages. The pipeline design improves upon conventional 
techniques that describe generating the key and subsequently consuming the key. 

While the invention has been particularly shown and described with reference 
to specific embodiments thereof, it will be understood by those skilled in the art that 
changes in the form and details of the disclosed embodiments may be made without 
departing from the spirit or scope of the invention. For example, embodiments of the 
present invention may be employed with a variety of encryption algorithms and 
should not be restricted to the ones mentioned above. Therefore, the scope of the 
invention should be determined with reference to the appended claims. 
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